Creating a CCGbank and a Wide-Coverage CCG Lexicon for German
نویسنده
چکیده
We present an algorithm which creates a German CCGbank by translating the syntax graphs in the German Tiger corpus into CCG derivation trees. The resulting corpus contains 46,628 derivations, covering 95% of all complete sentences in Tiger. Lexicons extracted from this corpus contain correct lexical entries for 94% of all known tokens in unseen text.
منابع مشابه
Extending CCGbank with Quotes and Multi-modal CCG
CCGbank is an automatic conversion of the Penn Treebank to Combinatory Categorial Grammar (CCG). We present two extensions to CCGbank which involve manipulating its derivation and category structure. We discuss approaches for the automatic re-insertion of removed quote symbols and evaluate their impact on the performance of the C&C CCG parser. We also analyse CCGbank to extract a multi-modal CC...
متن کاملHindi CCGbank: CCG Treebank from the Hindi Dependency Treebank
In this paper, we present an approach for automatically creating a Combinatory Categorial Grammar (CCG) treebank from a dependency treebank for the Subject-Object-Verb language Hindi. Rather than a direct conversion from dependency trees to CCG trees, we propose a two stage approach: a language independent generic algorithm first extracts a CCG lexicon from the dependency treebank. A determinis...
متن کاملPunctuation Normalisation for Cleaner Treebanks and Parsers
Although punctuation is pervasive in written text, their treatment in parsers and corpora is often second-class. We examine the treatment of commas in CCGbank, a wide-coverage corpus for Combinatory Categorial Grammar (CCG), reanalysing its comma structures in order to eliminate a class of redundant rules, obtaining a more consistent treebank. We then eliminate these rules from C&C, a wide-cove...
متن کاملSemi-supervised lexical acquisition for wide-coverage parsing
State-of-the-art parsers suffer from incomplete lexicons, as evidenced by the fact that they all contain built-in methods for dealing with out-of-lexicon items at parse time. Since new labelled data is expensive to produce and no amount of it will conquer the long tail, we attempt to address this problem by leveraging the enormous amount of raw text available for free, and expanding the lexicon...
متن کاملChinese CCGbank: extracting CCG derivations from the Penn Chinese Treebank
Automated conversion has allowed the development of wide-coverage corpora for a variety of grammar formalisms without the expense of manual annotation. Analysing new languages also tests formalisms, exposing their strengths and weaknesses. We present Chinese CCGbank, a 760,000 word corpus annotated with Combinatory Categorial Grammar (CCG) derivations, induced automatically from the Penn Chines...
متن کامل